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Abstract 



Contemporary global optimization algorithms are based on local measures of utility, rather 
than a probability measure over location and value of the optimum. They thus attempt to 
collect low function values, not to learn about the optimum. The reason for the absence 
of probabilistic global optimizers is that the corresponding inference problem is intractable 
in several ways. This paper develops desiderata for probabilistic optimization algorithms, 
then presents a concrete algorithm which addresses each of the computational intractabil- 
ities with a sequence of approximations and explicitly adresses the decision problem of 
maximizing information gain from each evaluation. 

Keywords: Optimization, Probability, Information, Gaussian Processes, Expectation 
Propagation 

1. Introduction 

Optimization problems are ubiquitous in science, engineering, and economics. Over time 
the requirements of many separate fields have led to a heterogenous set of settings and al- 
gorithms. Speaking very broadly, however, there are two distinct regimes for optimization. 
In the first one, relatively cheap function evaluations take place on a numerical machine 
and the goal is to find a "good" region of low or high function values. Noise tends to be 
small or negligible, and derivative observations are often available at low additional cost; 
but the parameter space may be very high-dimensional. This is the regime of numerical, 
local or convex optimization, often encountered as a sub-problem of machine learning al- 
gorithms. Popular algorithms for such settings include quasi-Newton methods (Broyden 
et al., 1965; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970), the conjugate gradient method 
(Hestenes and Stiefel, 1952), and stochastic optimization and evolutionary search methods 
(e.g. Hansen and Ostermeier (2001)), to name only a few. Since these algorithms perform 
local search, constraints on the solution space are often a crucial part of the problem. Thor- 
ough introductions can be found in the textbooks by Nocedal and Wright (1999) and Boyd 
and Vandenberghe (2004). This paper will utilize algorithms from this domain, but it is 
not its primary subject. 

In the second milieu, which this paper addresses, the function itself is not known and 
needs to be learned during the search for its global minimum within some measurable 
(usually: bounded) domain. Here, the parameter space is often relatively low-dimensional, 
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but evaluating the function involves a monetarily or morally expensive physical process 
- building a prototype, drilling a borehole, killing a rodent, treating a patient. Noise is 
often a nontrivial issue, and derivative observations, while potentially available, cannot be 
expected in general. While algorithms for such applications need to be tractable, their 
most important desideratum is efficient use of data, rather than raw computational cost. 
This domain is often called global optimization, but is also closely associated with the field 
of experimental design and related to the concept of exploration in reinforcement learning. 
The learned model of the function is also known as a response surface in some communities. 
The two contributions of this paper are a probabilistic view on this field, and a concrete 
algorithm for such problems. 

1.1 Problem Definition 

We define the problem of probabilistic global optimization: Let / C M D be some bounded 
domain of the real vector space. There is a function / : I -> R, and our knowledge about / 
is described by a probability measure p(f) over the space of functions i" -> R. This induces 
a measure 



,{x) = p[x = arg min f(x)} = I p(f) TT 9[f(x) - f(x)\ df (1) 



were 9 is Heaviside's step function. The exact meaning of the "infinite product" over the 
entire domain / in this equation should be intuitively clear, but is defined properly in the 
Appendix. Note that the integral is over the infinite-dimensional space of functions. We 
assume we can evaluate the function 1 at any point x G / within some bounded domain /, 
obtaining function values y(x) corrupted by noise, as described by a likelihood p(y \ f(x)). 
Finally, let L(x*,x m i n ) be a loss function describing the cost of naming x* as the result of 
optimization if the true minimum is at x m ; n . This loss function induces a loss functional 
£(Pmin) assigning utility to the uncertain knowledge about x m i n , as 

£(i>min) = / [minL(x*,£ min )]p min (x min ) dx min . (2) 
J I x 

The goal of global optimization is to decrease the expected loss after H function evaluations 
at locations x = {x\, . . . , xh} C /. The expected loss is 

(C) H = J p(y\ x)C(p min (x | y, x)) dy = J J p(y \ f(x))p(f(x) \ x)£(p min (x \ y, x)) dy df 

(3) 

where £(p m i n (x \ y , x)) should be understood as the cost assigned to the measure p m i n (x) 
induced by the posterior belief over / after observations y = {yi, ■ ■ ■ ,Vh} C M at the 
locations x. 

The remainder of this paper will replace the symbolic objects in this general definition 
with concrete measures and models to construct an algorithm we call Entropy Search. But it 
is useful to pause at this point to contrast this definition with other concepts of optimization. 



1. We may further consider observations of linear operations on /. This includes derivative and integral 
observations of any order, if they exist. Section 2.8.1 addresses this point; it is unproblematic under our 
chosen prior, but clutters the notation, and is thus left out elsewhere in the paper. 
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Probabilistic Optimization The distinctive aspect of our definition of "optimization" is 
Equation (1), an explicit role for the function's extremum. Previous work did not consider 
the extremum so directly. In fact, many frameworks do not even use a measure over the 
function itself. An example of optimizers that only implicitly encode assumptions about 
the function are genetic algorithms (Schmitt, 2004) and evolutionary search (Hansen and 
Ostermeier, 2001). If such formulations feature the global minimum x m i n at all, then only in 
statements about the limit behavior of the algorithm after many evaluations. Not explicitly 
writing out the prior over the function space can have advantages: Probabilistic analyses 
tend to involve intractable integrals; a less explicit formulation thus allows to construct 
algorithms with interesting properties that would be entirely intractable from a probabilistic 
viewpoint. But non-probabilistic algorithms cannot make explicit statements about the 
location of the minimum. At best, they may be able to provide bounds. 

Fundamentally, reasoning about optimization of functions on continuous domains after 
finitely many evaluations, like any other inference task on spaces without natural measures, 
is impossible without prior assumptions. For intuition, consider the following thought ex- 
periment: Let (xQ,y ) be a finite, possibly empty, set of previously collected data. For 
simplicity, and without loss of generality, assume there was no measurement noise, so the 
true function actually passes through each data point. Say we want to suggest that the 
minimum of / may be at x* € /. To make this argument, we propose a number of functions 
that pass through (xQ,y ) and are minimized at x* . We may even suggest an uncountably 
infinite set of such functions. Whatever our proposal, a critic can always suggest another 
uncountable set of functions that also pass through the data, and are not minimized at 
x* . To argue with this person, we need to reason about the relative size of our set versus 
their set. Assigning size to infinite sets amounts to the aforementioned normalized mea- 
sure over admissible functions p(f), and the consistent way to reason with such measures 
is probability theory (Kolmogorov, 1933; Cox, 1946). Of course, this amounts to imposing 
assumptions on /, but this is a fundamental epistemological limitation of inference, not a 
special aspect of optimization. 

Relationship to the Bandit Setting There is a considerable amount of prior work on 
continuous bandit problems, also sometimes called "global optimization" (e.g. Kleinberg, 
2005; Griinewalder et al., 2010; Srinivas et al., 2010). The bandit concept differs from the 
setting defined above, and bandit regret bounds do not apply here: Bandit algorithms seek 
to minimize regret, the sum over function values at evaluation points, while probabilistic 
optimizers seek to infer the minimum, no matter what the function values at evaluation 
points. An optimizer gets to evaluate H times, then has to make one single decision regard- 
ing £(p m in)- Bandit players have to make H evaluations, such that the evaluations produce 
low values. This forces bandits to focus their evaluation policy on function value, rather 
than the loss at the horizon (see also Section 3.1). In probabilistic optimization, the only 
quantity that counts is the quality of the belief on p m i Q under C, after H evaluations, not 
the sum of the function values returned during those H steps. 

Relationship to Heuristic Gaussian Process Optimization and Response Surface 
Optimization There are also a number of works employing Gaussian process measures to 
construct heuristics for search, also known as "Gaussian process global optimization" (Jones 
et al., 1998; Lizotte, 2008; Osborne et al., 2009). As in our definition, these methods explic- 
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itly infer the function from observations, constructing a Gaussian process posterior. But 
they then evaluate at the location maximizing a heuristic u\p(f(x))] that turns the marginal 
belief over f(x) at x, which is a univariate Gaussian p(f{x)) = J\f[f(x); n(x), a 2 (x)], into an 
ad hoc utility for evaluation, designed to have high value at locations close to the function's 
minimum. Two popular heuristics are the probability of improvement (Lizotte, 2008) 

upi(x) =p[f(x) < V } = J^M(f(x);f,(x),a(x) 2 ) df(x) = $ ( ! ^^) (4) 

and expected improvement (Jones et al., 1998) 

u El (x) = E[min{0, (r, - f(x))}} = (r, - M )d> + °<t> {^^f) ( 5 ) 

where $>(z) = 1/2[1 + erf(z/\/2)] is the standard Gaussian cumulative density function, 
4>{x) = Af(x; 0, 1) is the standard Gaussian probability density function, and rj is a current 
"best guess" for a low function value, e.g. the lowest evaluation so far. 

These two heuristics have different units of measure: probability of improvement is a 
probability, expected improvement has the units of /. Both utilities differ markedly from 
Eq. (1), p m i n , which is a probability measure and as such a global quantity. See Figure 
2 for a comparison of the three concepts on an example. The advantage of the heuristic 
approach is that it is computationally lightweight, because the utilities have analytic form. 
But local measures cannot capture general decision problems of the type described above. 
For example, these algorithms do not capture the effect of evaluations on knowledge: A 
small region of high density p m i n (x) may be less interesting to explore than a broad region 
of lower density, because the expected change in knowledge from an evaluation in the 
broader region may be much larger, and may thus have much stronger effect on the loss. 
If the goal is to infer the location of the minimum (more generally: minimize loss at the 
horizon), the optimal strategy is to evaluate where we expect to learn most about the 
minimum (reduce loss toward the horizon), rather then where we think the minimum is 
(recall Section 1.1). The former is a nonlocal problem, because evaluations affect the belief, 
in general, everywhere. The latter is a local problem. 

2. Entropy Search 

The probable reason for the absence of global optimization algorithms from the literature 
is a number of intractabilities in any concrete realisation of the setting of Section 1.1. This 
section makes some choices and constructs a series of approximations, to arrive at a tangible 
algorithm, which we call Entropy Search. The derivations evolve along the following path. 

choosing p(f) We commit to a Gaussian process prior on / (Section 2.1). Limitations and 
implications of this choice are outlined, and possible extensions suggested, in Sections 
2.8.1 and 2.8.3. 

discretizing p m - m We discretize the problem of calculating p m i n , to a finite set of represen- 
ter points chosen from a non-uniform measure, which deals gracefully with the curse 
of dimensionality. Artefacts created by this discretization are studied in the tractable 
one-dimensional setting (Section 2.2). 
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Figure 1: A Gaussian process measure (rational quadratic kernel), conditioned on three 
previous observations (black crosses). Mean function in solid red, marginal stan- 
dard deviation at each location (two standard deviations) as light red tube. Five 
sampled functions from the current belief as dashed red lines. Arbitrary ordinate 
scale, zero in gray. 



approximating p m ; n We construct an efficient approximation to p m i n , which is required 
because Eq. (1), even for finite-dimensional Gaussian measures, is not analytically 
tractable, (Section 2.3). We compare the approximation to the (asymptotically exact, 
but more expensive) Monte Carlo solution. 

predicting change to p m ; n The Gaussian process measure affords a straightforward but 
rarely used analytic probabilistic formulation for the change of p(f) as a function of 
the next evaluation point (Section 2.4). 

choosing loss function We commit to relative entropy from a uniform distribution as the 
loss function, as this can be interpreted as a utility on gained information about the 
location of the minimum (Section 2.5). 

predicting expected information gain From the predicted change, we construct a first- 
order expansion on (£) from future evaluations and, again, compare to the asymptot- 
ically exact Monte Carlo answer (Section 2.6). 

choosing greedily Faced with the exponential cost of the exact dynamic problem to the 
horizon H, we accept a greedy approach for the reduction of (C) at every step. We 
illustrate the effect of this shortcut in an example setting (Section 2.7). 

2.1 Gaussian Process Measure on / 

The remainder of the paper commits to Gaussian process measures for p{f). These are 
convenient for the task at hand due to their descriptive generality and their convenient 
analytic properties. Since this paper is aimed at readers from several communities, this 
section contains a very brief introduction to some relevant aspects of Gaussian processes; 
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readers familiar with the subject can safely skip ahead. A thorough introduction can be 
found in a textbook of Rasmussen and Williams (2006). Some readers from other fields 
may find it helpful to know that more or less special cases of Gaussian process inference are 
elsewhere known under names like Kriging (Krige, 1951) and Kolmogorov-Wiener prediction 
(Wiener and Masani, 1957), but while these frameworks are essentially the same idea, 
the generality of their definitions varies, so restrictions of those frameworks should not be 
assumed to carry over to Gaussian process inference as understood in machine learning. 

A Gaussian process is an infinite-dimensional probability density, such that each linear 
finite-dimensional restriction is multivariate Gaussian. The infinite-dimensional space can 
be thought of as a space of functions, and the finite-dimensional restrictions as values of 
those functions at locations {a?*}i=i,...,jv- Gaussian process beliefs are parametrized by a 
mean function m : I -> M. and a covariance function k : I x I -> R. For our particular 
analysis, we restrict the domain / to finite, compact subsets of the real vector spaces R D . 
The covariance function, also known as the kernel, has to be positive definite, in the sense 
that any finite-dimensional matrix with elements Kij = k(xi, Xj) has to be positive definite 
\/xi,Xj e I. A number of such kernel functions are known in the literature, and differ- 
ent kernel functions induce different kinds of Gaussian process measures over the space of 
functions. Among the most widely used kernels for regression are the squared exponential 
kernel 



which induces a measure that puts nonzero mass on only smooth functions of characteristic 
length-scale S and signal variance s 2 (MacKay, 1998b), and the rational quadratic kernel 
(Matern, 1960; Rasmussen and Williams, 2006) 



which induces a belief over smooth functions whose characteristic length scales are a scale 
mixture over a distribution of width 1/a and location S. Other kernels can be used to 
induce beliefs over non-smooth functions (Matern, 1960), and even over non-continuous 
functions (Uhlenbeck and Ornstein, 1930). Experiments in this paper use the two kernels 
defined above, but the results apply to all kernels inducing beliefs over continuous functions. 
While there is a straightforward relationship between kernel continuity and the mean square 
continuity of the induced process, the relationship between the kernel function and the 
continuity of each sample is considerably more involved (Adler, 1981, §3). Regularity of 
the kernel also plays a nontrivial role in the question wether the distribution of infima of 
samples from the process is well-defined at all (Adler, 1990). In this work, we side-step 
this issue by assuming that the chosen kernel is sufficiently regular to induce a well-defined 
belief j9 m ; n as defined by Equation (26). 

Kernels form a semiring: products and sums of kernels are kernels. These operations can 
be used to generalize the induced beliefs over the function space (Section 2.8.3). Without 
loss of generality, the mean function is often set to m = in theoretical analyses, and this 
paper will keep with this tradition, except for Section 2.8.3. Where m is nonzero, its effect 
is a straightforward off-set p{f{x)) — t> p(f(x) — m{x)). 



ksE(x, x'; S, s) = s 2 exp — (x — x') 1 S 1 (x — x') 



(6) 
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For the purpose of regression, the most important aspect of Gaussian process priors 
is that they are conjugate to the likelihood from finitely many observations (X,Y) = 
{xi, yi}i=i,...,jv of the form yi(xi) = f(xi) + £ with Gaussian noise £ ~ M(0,a 2 ). The 
posterior is a Gaussian process with mean and covariance functions 

H(x*) = k x * )X [K x ,x+<? 2 I]~ l y ; ^{x*,x*) = k x *^-k x ^x[K x ,x+cr' 2 lX~ 1 kx, x « (8) 

where Kx,x is the kernel Gram matrix K x ^ x = k(xi,Xj), and other objects of the form k a ,b 

are also matrices with elements k^'^ = k(ai,bj). Finally, for what follows it is important 
to know that it is straightforward to sample "functions" (point-sets of arbitrary size from I) 
from a Gaussian process. To sample the value of a particular sample at the M locations X* , 
evaluate mean and variance function as a function of any previously collected datapoints, 
using Eq. (8), draw a vector £ ~ J| M AA(0, 1) of M random numbers i.i.d. from a standard 
one-dimensional Gaussian distribution, then evaluate 

f(X*) =n(X*) + C[X(X*,X*)YC (9) 
where the operator C denotes the Cholesky decomposition (Benoit, 1924). 

2.2 Discrete Representations for Continuous Distributions 

Having established a probability measure p(f) on the function, we turn to constructing 
the belief p m i n (x) over its minimum. Inspecting Equation (1), it becomes apparent that 
it is challenging in two ways: First, because it is an integral over an infinite-dimensional 
space, and second, because even on a finite-dimensional space it may be a hard integral for 
a particular p(f). This section deals with the former issue, the following Section 2.3 with 
the latter. 

It may seem daunting that p m in involves an infinite-dimensional integral. The crucial 
observation for a meaningful approximation in finite time is that regular functions can be 
represented meaningfully on finitely many points. If the stochastic process representing the 
belief over / is sufficiently regular, then Equation (1) can be approximated arbitrarily well 
with finitely many representer points. The discretization grid need not be regular - it may 
be sampled from any distribution which puts non-zero measure on every open neighborhood 
of /. This latter point is central to a graceful handling of the curse of dimensionality: The 
naive approach of approximately solving Equation (1) on a regular grid, in a D-dimensional 
domain, would require 0(exp(D)) points to achieve any given resolution. This is obviously 
not efficient: Just like in other numerical quadrature problems, any given resolution can be 
achieved with fewer representer points if they are chosen irregularly, with higher resolution 
in regions of greater influence on the result of integration. We thus choose to sample 
representer points from a proposal measure u, using a Markov chain Monte Carlo sampler 
(our implementation uses shrinking rank slice sampling (Thompson and Neal, 2010)). 

What is the effect of this stochastic discretization? A non-uniform quadrature measure 
u(x) for N representer locations {xj}i=i,...,jv leads to varying widths in the "steps" of the 
representing staircase function. As JV — > oo, the width of each step is approximately 
proportional to (u(xi)iV) _1 . Section 2.3 will construct a discretized q m in{xi) that is an 
approximation to the probability that / m ; n occurs within the step at Xi. So the approximate 
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Figure 2: p m i n induced by p(f) from Figure 1. p(f) repeated for reference. Blue solid line: 
Asymptotically exact representation gained from exact sampling of functions on 
a regular grid. For comparison, the plot also shows the local utilities probability 
of improvement (dashed magenta) and expected improvement (solid magenta) 
often used for Gaussian process global optimization. Blue circles: Approximate 
representation on representer points, sampled from probability of improvement 
measure. Stochastic error on sampled values, due to only asymptotically correct 
assignment of mass to samples, and varying density of points, focusing on relevant 
areas of p m m- This plot uses arbitrary scales for each object: The two heuristics 
have different units of measure, differing from that of Pmm- Notice the interesting 
features of p m \ n at the boundaries of the domain: The prior belief encodes that / 
is smooth, and puts finite probability mass on the hypothesis that / has negative 
(positive) derivative at the right (left) boundary of the domain. With nonzero 
probability, the minimum thus lies exactly on the boundary of the domain, rather 
than within a Taylor radius of it. 
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a e[f(xj) - f(xi)] 



Figure 3: Graphical model providing motivation for EP approximation on p m i Q - See text 



Pmin on this step is proportional to q m m(xi)u(xi), and can be easily normalized numerically, 
to become an approximation to p m \ n . 

How should the measure u be chosen? Unfortunately, the result of the integration, being 
a density rather than a function, is itself a function of u, and the loss- function is also part 
of the problem. So it is nontrivial to construct an optimal quadrature measure. Intuitively, 
a good proposal measure for discretization points should put high resolution on regions of / 
where the shape of p m i n has strong influence on the loss, and on its change. For our choice 
of loss function (Section 2.5), it is a good idea to choose u such that it puts high mass on 
regions of high value for p m i n . But for other functions, this need not always be the case. 

We have experimented with a number of ad-hoc choices for u, and found the aforemen- 
tioned "expected improvement" and "probability of improvement" (Section 1.1) to lead to 
reasonably good performance. We use these functions for a similar reason as their original 
authors: Because they tend to have high value in regions where p m i D is also large. To avoid 
confusion, however, note that we use these functions as unnormalized measures to sample 
discretization points for our calculation of p m i n , not as an approximation for p m { n itself, as 
was done in previous work by other authors. Defects in these heuristics have weaker effect 
on our algorithm than in the cited works: In our case, if u is not a good proposal measure, 
we simply need more samples to construct a good representation of p m i n . In the limit of 
N — > oo, all choices of u perform equally well, as long as they put nonzero mass on all open 
neighborhoods of the domain. 

2.3 Approximating p m - m with Expectation Propagation 

The previous Section 2.2 provided a way to construct a non- uniform grid of N discrete 
locations Xi, i = 1, . . . , N. The restriction of the Gaussian process belief to these locations 
is a multivariate Gaussian density with mean fi £ l w and covariance X 6 M. NxN . So 
Equation (1) reduces to a discrete probability distribution (as opposed to a density) 



This is a multivariate Gaussian integral over a half-open, convex, piecewise linearly con- 
strained integration region - a polyhedral cone. Unfortunately, such integrals are known 
to be intractable (Plackett, 1954; Lazard-Holly and Holly, 2003). However, it is possible to 
construct an effective approximation q m i n based on Expectation Propagation (EP) (Minka, 
2001): Consider the belief p(f(x)) as a "prior message" on f(x), and each of the terms in 



for details. 




N 



(10) 
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the product as one factor providing another message. This gives the graphical model shown 
in Figure 3. Running EP on this graph provides an approximate Gaussian marginal, whose 
normalisation constant q m m(xi), which EP also provides, approximates p(f | x m in = Xi). The 
EP algorithm itself is somewhat involved, and there are a number of algorithmic technical- 
ities to take into account for this particular setting. We refer interested readers to recent 
work by Cunningham et al. (2011), which gives a detailed description of these aspects. The 
cited work also establishes that, while EP's approximations to Gaussian integrals are not 
always reliable, in this particular case, where there are as many constraints as dimensions 
to the problem, the approximation is generally of high quality (see Figure 4 for an exam- 
ple). An important advantage of the EP approximation over both numerical integration 
and Monte Carlo integration (see next Section) is that it allows analytic differentiation of 
<Zmin with respect to the parameters fi and ~E (Cunningham et al., 2011; Seeger, 2008). This 
fact will become important in Section 2.6. 

The computational cost of this approximation is considerable: Each computation of 
qm.m(xi), for a given i, involves N factor updates, which each have rank 1 and thus cost 
0(N 2 ). So, overall, the cost of calculating q m \ n {x) is 0(N 4 ). This means N is effectively 
limited to well below N = 1000. Our implementation uses a default of N = 50, and can 
calculate next evaluation points in ~ 10 seconds. Once again, it is clear that this algorithm 
is not suitable for simple numerical optimization problems; but a few seconds are arguably 
an acceptable waiting time for physical optimization problems. 

2.3.1 An alternative: Sampling 

An alternative to EP is Monte Carlo integration: sample S functions exactly from the 
Gaussian belief on p(f), at cost 0(N 2 ) per sample, then find the minimum for each sample 
in O(N) time. This technique was used to generate the asymptotically exact plots in 
Figures 2 and following. It has overall cost 0(SN 3 ), and can be implemented efficiently 
using Matrix-Matrix multiplications, so each evaluation of this algorithm is considerably 
faster than EP. It also has the advantage of asymptotic exactness. But, unfortunately, 
it provides no analytic derivatives, because of strong discontinuity in the step functions 
of Eq. (1). So the choice is between a first-order expansion using EP (see Section 2.6) 
which is expensive, but provides a re-usable, differentiable function, and repeated calls to 
a cheaper, asymptotically exact sampler. In our experiments, the former option appeared 
to be considerably faster, and of acceptable approximative quality. But for relatively high- 
dimensional optimization problems, where one would expect to require relatively large N 
for acceptable discretization, the sampling approach can be expected to scale better. The 
code we plan to publish upon acceptance offers a choice between these two approaches. 

2.4 Predicting Innovation from Future Observations 

As detailed in Equation (3), the optimal choice of the next H evaluations is such that the 
expected change in the loss (C) x is minimal, i.e. effects the biggest possible expected drop 
in loss. The loss is a function of p m - m , which in turn is a function of p(f). So predicting 
change in loss requires predicting change in p(f) as a function of the next evaluation points. 
It is another convenient aspect of Gaussian processes that they allow such predictions in 
analytic form (Hennig, 2011): Let previous observations at Xq have yielded observations 
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Fi gure 4: EP-approximation to Pmin 

(green). Other plots as in previous figures. EP achieves 
good agreement with the asymptotically exact Monte Carlo approximation to 
p m in, including the point masses at the boundaries of the domain. 




Figure 5: Innovation from two observations at x = —3 and x = 3. Current belief in red, as 
in Figure 1. Samples from the belief over possible beliefs after observations at x 
in blue. For each sampled innovation, the plot also shows the induced innovated 
Pmin (lower sampling resolution as previous plots). Innovations from several (here: 
two) observations can be sampled jointly. 
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Yq. Evaluating at locations X will give new observations Y, and the mean will be given 

= k x *,Xo K x 1 ,x Y o + ( k x*,x ~ k x \x K x \ x k X(hX )x (n) 

(kx,x ~ k x , XQ K x \ x k XihX )-\Y - k x , Xo K x ) uX Y ) 
= n Q {x*) + ^{x\X)^ l (X, X)(Y - hq{X)) 

where K^'J^ = k(ai,bj) + bijO 1 . The step from the first to the second line involves an 
application of the matrix inversion lemma, the last line uses the mean and covariance 
functions conditioned on the dataset (Xo,Yq) so far. Since Y is presumed to come from 
this very Gaussian process belief, we can write 

Y = n(X)+C\E(X, X)] J ft'+au = /i(X)+C[S(X, X)+a 2 I H ] 1 ft ft, ft', u> ~ M{0, I H ), 

(12) 

and Equation (11) simplifies. An even simpler construction can be made for the covariance 
function. We find that mean and covariance function of the posterior after observations 
{X, Y) are mean and covariance function of the prior, incremented by the innovations 

Afi x ,n(x*) = X(x*,X)X-\X, X) C[S(X, X) + a 2 I H ]ft 
AS x (x*,i») = E^JOE-^X^X,^). 

The change to the mean function is stochastic, while the change to the covariance function 
is deterministic. Both innovations are functions both of X and of the evaluation points 
x*. One use of this result is to sample (C) x by sampling innovations, then evaluating the 
innovated p m i n for each innovation in an inner loop, as described in Section 2.3.1. An alter- 
native, described in the next section, is to construct an analytic first order approximation 
to (C)x from the EP prediction constructed in Section 2.3. As mentioned above, the ad- 
vantage of this latter option is that it provides an analytic function, with derivatives, which 
allows efficient numerical local optimization. 

2.5 Information Gain — the Log Loss 

To solve the decision problem of where to evaluate the function next in order to learn most 
about the location of the minimum, we need to say what it means to "learn" . Thus, we 
require a loss functional that evaluates the information content of innovated beliefs p m i n . 
This is, of course, a core idea in information theory. The seminal paper by Shannon (1948) 
showed that the negative expectation of probability logarithms, 

H\p} = -(log p) p = Pi log pi (14) 

i 

known as entropy, has a number of properties that allow its interpretation as a measure 
of uncertainty represented by a probability distribution p. It's value can be be interpreted 
as the number of natural information units an optimal compression algorithm requires 
to encode a sample from the distribution, given knowledge of the distribution. However, 
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ure 6: 1-step predicted loss improvement for the log loss (relative entropy). Upper part 
of plot as before, for reference. Monte Carlo prediction on regular grid as solid 
black line. Monte Carlo prediction from sampled irregular grid as dot-dashed 
black line. EP prediction on regular grid as black dashed line. EP prediction from 
samples as black dashed line. The minima of these functions, where the algorithm 
will evaluate next, are marked by vertical lines. While the predictions from the 
various approximations are not identical, they lead to similar next evaluation 
points. Note that these next evaluation points differ qualitatively from the choice 
of the GP optimization heuristics of Figure 2. Since each approximation is only 
tractable up a multiplicative constant, the scales of these plots are arbitrary, and 
only chosen to overlap for convenience. 
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it has since been pointed out repeatedly that this concept does not trivially generalize 
to probability densities. A density p(x) has a unit of measure so its logarithm is 

not well-defined, and one cannot simply replace summation with integration in Equation 
(14). A functional that is well-defined on probability densities and preserves many of 
the information-content interpretations of entropy (Jaynes and Bretthorst, 2003) is relative 
entropy, also known as Kullback-Leibler divergence (Kullback and Leibler, 1951). We use 
its negative value as a loss function emphasizing information gain. 

C KL (p;b) = - [ p(x)log^f\dx (15) 
J o[x) 

As base measure b we choose the uniform measure Uj(x) = I/] -1 over /, which is well- 
defined because / is presumed to be bounded 2 . With this choice, the loss is maximized (at 
C = 0) for a uniform belief over the minimum, and diverges toward negative infinity if p 
approaches a Dirac point distribution. The resulting algorithm, Entropy Search, will thus 
choose evaluation points such that it expects to move away from the uniform base measure 
toward a Dirac distribution as quickly as possible. 

The reader may wonder: What about the alternative idea of maximizing, at each eval- 
uation, entropy relative to the current p m i n ? This would only encourage the algorithm to 
attempt to change the current belief, but not necessarily in the right direction. For ex- 
ample, if the current belief puts very low mass on a certain region, an evaluation that has 
even a small chance of increasing p m i n in this region could appear more favorable than an 
alternative evaluation predicted to have a large effect on regions where the current Pmin 

has 

larger values. The point is not to just change p m i n , but to change it such that it moves 
away from the base measure. 

Recall that we approximate the density p{x) using a distribution p(xi) on a finite set 
{xi} of representer points, which define steps of width proportional, up to stochastic error, 
to an unnormalized measure u{xi). In other words, we can approximate p m i n {x) as 

p{xi)Nu(xi) f 
Pmin(x) ~ Z u = u(x) dx Xi = arg mm \\x — Xj\\. (lb) 

J {xj} 

We also note that after A samples, the unit element of measure has size, up to stochastic 
error, of Axi « u{x")N • ^° we can approximately represent the loss 



^KlXPmin; b) « - ^ p min (xi)Axi log P ™ n ^^ 



ST^a <~ \ i_~ Pmin(Xi)Nu(Xi) 



- ~ 7 Pmin fat) bg ■ 

V Z u b(xi) ( 17 ) 

E. r \ l Pmm{xi)u{xi) fZ u \ s -^ r . 
Pmm{Xi) log — + log ^ — j Pmin 

= H[p min ] - (logn)p min + (log6)p min + logZ u - log A 



Although uniform measures appeal as a natural representation of ignorance, they do encode an as- 
sumption about / being represented in a "natural" way. Under a nonlinear transformation of I, the 
distribution would not remain uniform. For example, uniform measures on the [0,1] simplex appear 
bell-shaped in the softmax basis (MacKay, 1998a). So, while b here does not represent prior knowledge 
on :r m i n per se, it does provide a unit of measure to information and as such is nontrivial. 
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which means we do not require the normalization constant Z u for optimization of £kl- For 
our uniform base measure, the third term in the last line is a constant, too; but other base 
measures would contribute nontrivially. 

2.6 First-Order Approximation to (£) 

Since EP provides analytic derivatives of p m - m with respect to mean and covariance of the 
Gaussian measure over /, we can construct a first order expansion of the expected change in 
loss from evaluations. To do so, we consider, in turn, the effect of evaluations at X on the 
measure on /, the induced change in p m m, and finally the change in C. Since the change to 
the mean is Gaussian stochastic, Ito's Lemma (Ito, 1951) applies. The following Equation 
uses the summation convention: double indices in products are summed over. 



(A£)x = J. 



M(n;o,i)dn-c\p° min ]. (is) 



The first line contains deterministic effects, the first term in the second line covers the 
stochastic aspect. Monte Carlo integration over the stochastic effects can be performed 
approximately using a small number of samples fi. These samples should be drawn only 
once, at first calculation, to get a differentiable function (AC)x that can be re-used in 
subsequent optimization steps. 

The above formulation is agnostic with respect to the loss function. Hence, in principle, 
Entropy Search should be easy to generalize to different loss functions. But recall that the 
fidelity of the calculation of Equation (18) depends on the intermediate approximate steps, 
in particular the choice of discretization measure u. We have experimented with other loss 
functions and found it difficult to find a good measure u providing good performance for 
many such loss functions. So this paper is limited to the specific choice of the relative 
entropy loss function. Generalization to other losses is future work. 



2.7 Greedy Planning, and its Defects 

The previous sections constructed a means to predict, approximately, the expected drop 
in loss from H new evaluations at locations X = {a^}^^...^. The remaining task is to 
optimize these locations. It may seem pointless to construct an optimization algorithm 
which itself contains an optimization problem, but note that this new optimization problem 
is quite different from the initial one. It is a numerical optimization problem, of the form 
described in Section 1: We can evaluate the utility function numerically, without noise, 
with derivatives, and at hopefully relatively low cost compared to the physical process we 
are ultimately trying to optimize. 

Nevertheless, one issue remains: Optimizing evaluations over the entire horizon H is a 
dynamical programming problem, which, in general, has cost exponential in H. However, 
this problem has a particular structure: Apart from the fact that evaluations drawn from 
Gaussian process measures are exchangeable, there is also other evidence that optimization 
problems are benign from the point of view of planning. For example, Srinivas et al. (2010) 
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Figure 7: Expected drop in relative entropy (see Section 2.5) from two additional evalua- 
tions to the three old evaluations shown in previous plots. First new evaluation 
on abscissa, second new evaluation on ordinate, but due to the exchangeability of 
Gaussian process measures, the plot is symmetric. Diagonal elements excluded 
for numerical reasons. Blue regions are more beneficial than red ones. The rela- 
tively complicated structure of this plot illustrates the complexity of finding the 
optimal ii-step evaluation locations. 
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show that the information gain over the function values is submodular, so that greedy 
learning of the function comes close to optimal learning of the function. While is is not 
immediately clear whether this statement extends to our issue of learning about the func- 
tion's minimum, it is obvious that the greedy choice of whatever evaluation location most 
reduces expected loss in the immediate next step is guaranteed to never be catastrophically 
wrong. In contrast to general planning, there are no "dead ends" in inference problems. 
At worst, a greedy algorithm may choose an evaluation point revealed as redundant by a 
later step. But thanks to the consistency of Bayesian inference in general, and Gaussian 
process priors in particular (van der Vaart and van Zanten, 2011), no decision can lead to an 
evaluation that somehow makes it impossible to learn the true function afterward. In our 
approximate algorithm, we thus adopt this greedy approach. It remains an open question 
for future research whether approximate planning techniques can be applied efficiently to 
improve performance in this planning problem. 

2.8 Further Issues 

This section digresses from the main line of thought to briefly touch upon some extensions 
and issues arising from the choices made in previous sections. For the most part, we point 
out well-known analytic properties and approximations that can be used to generalize the 
algorithm. Since they apply to Gaussian process regression rather than the optimizer itself, 
they will not play a role in the empirical evaluation of Section 3. 

2.8.1 Derivative Observations 

Gaussian process inference remains analytically tractable if instead of, or in addition to 
direct observations of /, we observe the result of any linear operator acting on /. This 
includes observations of the function's derivatives (Rasmussen and Williams, 2006, §9.4) 
and, with some caveats, to integral observations (Minka, 2000). The extension is pleas- 
ingly straightforward: The kernel defines covariances between function values. Covariances 
between the function and its derivatives are simply given by 



so kernel evaluations simply have to be replaced with derivatives (or integrals) of the kernel 
where required. Obviously, this operation is only valid as long as the derivatives and 
integrals in question exist for the kernel in question. Hence, all results derived in previous 
sections for optimization from function evaluations can trivially be extended to optimization 
from function and derivative observations, or from only derivative observations. 

2.8.2 Learning Hyperparameters 

Throughout this paper, we have assumed kernel and likelihood function to be given. In 
real applications, this will not usually be the case. In such situations, the hyperparameters 
defining these two functions, and if necessary a mean function, can be learned from the data, 
either by setting them to maximum likelihood values, or by full-scale Bayesian inference 
using Markov chain Monte Carlo methods. See Rasmussen and Williams (2006, §5) and 




(19) 
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Figure 8: Generalizing GP regression. Left: Samples from different priors. Right: Pos- 
teriors (mean, two standard deviations) after observing three datapoints with 
negligible noise (kernel parameters differ between the two plots), base: standard 
GP regression with Matern kernel, kernel: sum of two kernels (square exponential 
and rational quadratic) of different length scales and strengths, poly: polynomial 
(here: quadratic) mean function, lik: Non-Gaussian likelihood (here: logarithmic 
link function). The scales of both x and f(x) are functions of kernel param- 
eters, so the numerical values in this plot have relevance only relative to each 
other. Note the strong differences in both mean and covariance functions of the 
posteriors. 



Murray and Adams (2010) for details. In the latter case, the belief p{f) over the function 
is a mixture of Gaussian processes. To still be able to use the algorithm derived so far, 
we approximate this belief with a single Gaussian process by calculating expected values of 
mean and covariance function. 

Ideally, one would want to take account of this hierarchical learning process in the 
decision problem addressed by the optimizer. This adds another layer of computation com- 
plexity to the problem, and is outside of the scope of this paper. Here, we contend ourselves 
with considering the uncertainty of the Gaussian process conditioned on a particular set of 
hyperparameters. 

2.8.3 Limitations and Extensions of Gaussian Processes for Optimization 

Like any probability measure over functions, Gaussian process measures are not arbitrarily 
general. In particular, the most widely used kernels, including the two mentioned above, 
are stationary, meaning they only depend on the difference between locations, not their 
absolute values. Loosely speaking, the prior "looks the same everywhere". One may argue 
that many real optimization problems do not have this structure. For example, it may be 
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known that the function tends to have larger functions values toward the boundaries of / 
or, more vaguely, that it is roughly "bowl-shaped". Fortunately, a number of extensions 
readily suggest themselves to address such issues (Figure 8). 

Parametric Means As pointed out in Section 2.1, we are free to add any parametric 
general linear model as the mean function of the Gaussian process. 



Using Gaussian beliefs on the weights w i of this model, this model may be learned at 
the same time as the Gaussian process itself (Rasmussen and Williams, 2006, §2.7). 
Polynomials such as the quadratic <fi(x) = [x;xx J ] are beguiling in this regard, but 
they create an explicit "origin" at the center of /, and induce strong long-range cor- 
relations between opposite ends of /. This seems pathological: In most settings, 
observing the function on one end of / should not tell us much about the value at the 
opposite end of /. But we may more generally choose any feature set for the linear 
model. For example, a set of radial basis functions 4>i(x) = exp(||a; — Ci\\ 2 /if) around 
locations Cj at the rims of / can explain large function values in a region of width ti 
around such a feature, without having to predict large values at the center of /. This 
idea can be extended to a nonparametric version, described in the next point. 

Composite Kernels Since kernels form a semiring, we may sum a kernel of large length 
scale and large signal variance and a kernel of short length scale and low signal vari- 
ance. For example 

k(x,x') = k SE (x,x'; si, Si) + k m (x, x' , s 2 , S 2 , a 2 ) si > s 2 ; S l { > S^ (21) 

yields a kernel over functions that, within the bounded domain /, look like "rough 
troughs": global curvature paired with local stationary variations. A disadvantage of 
this prior is that it thinks "domes" just as likely as "bowls". An advantage is that it 
is a very flexible framework, and does not induce unwanted global correlations. 

Nonlinear Likelihoods An altogether different effect can be achieved by a non-Gaussian, 
non-linear likelihood function. For example, if / is known to be strictly positive, one 
may assume the noise model 



and learn g instead of /. Since the logarithm is a convex function, the minimum of 
the latent g is also a minium of /. Of course, this likelihood leads to a non-Gaussian 
posterior. To retain a tractable algorithm, approximate inference methods can be used 
to construct approximate Gaussian posteriors. In our example (labeled lik in Figure 
8), we used a Laplace approximation: It is straightforward to show that Equation 
(22) implies 




(20) 



P(y If) = A%;exp(sO,cr 2 ); / = exp(y) 



(22) 



d\ogp(y\g) 
dg 



9=9 



= =>g = logy 



d 2 logp(y\g) 
d 2 g 



9=9 



(23) 
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9 = log(/) 

Figure 9: Laplace approximation for a logarithmic Gaussian likelihood. True likelihood in 
red, Gaussian approximation in blue, maximum likelihood solution marked in 
grey. Four log relative values a = log (y/cr) of sample y and noise a (scaled in 
height for readability), a = — 1 (solid); a = (dash-dotted); a = 1 (dashed); 
a = 2 (dotted). The approximation is good for a » 0. 



so a Laplace approximation amounts to a heteroscedastic noise model, in which an 
observation (y,cr 2 ) is incorporated into the Gaussian process as (log(y), (a/y) 2 ). This 
approximation is valid if a <C y (see Figure 3). For functions on logarithmic scales, 
however, finding minima smaller than the noise level, at logarithmic resolution, is a 
considerably harder problem anyway. 



The right part of Figure 8 shows posteriors produced using the three approaches detailed 
above, and the base case of a single kernel with strong signal variance, when presented 
with the same three data points, with very low noise. The strong difference between the 
posteriors may be disappointing, but it is a fundamental aspect of inference: Different prior 
assumptions lead to different posteriors, and function space inference is impossible without 
priors. Each of the four beliefs shown in the Figure may be preferable over the others in 
particular situations. The polynomial mean describes functions that are almost parabolic. 
The exponential likelihood approximation is appropriate for functions with an intrinsic 
logarithmic scale. The sum kernel approach is pertinent for the search for local minima of 
globally stationary functions. Classic methods based on polynomial approximations are a 
lot more restrictive than any of the models described above. 

Perhaps the most general option is to use additional prior information I giving p(x m i n \ I) , 
independent of p(f), to encode outside information about the location of the minimum. 
Unfortunately, this is intractable in general. But it may be approached through approxi- 
mations. This option is outside of the scope of this paper, but will be the subject of future 
work. 
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Algorithm 1 Entropy Search 



1 


procedure EntropySearch(/c, I 


= P{y \ f{x)),u,H, (x,y)) 


2 


x ~ u(x, y) 


> discretize using measure u (Section 2.2) 


3 


[/x,S,A/i a .,AS a .] <- GP(M,a; 


y) f> infer function, innovation, from GP prior (2.1) 


4 


\A . «<2min O g m in C^min^ . 


EP(/i, XI) > approximate p m in (2.3) 


5 


if H=0 then 




6 


return g min 


> At horizon, return belief for final decision 


7 


else 




8 


x' <— arg vam(C) x 


> predict information gain; Eq. (18) 


9 


y' <- Evaluate (/(x')) 


> take measurement 


10 


EntropySearch(/c, I, u, H 


— 1, (x, y) U (x 1 , ?/')) > move to next evaluation 


11 


end if 




12 


end procedure 





2.9 Summary — the Entire Algorithm 

Algorithm 1 shows pseudocode for Entropy Search. It takes as input the prior, described 
by the kernel k, and the likelihood I = p(y \ f(x)), as well as the discretization measure u 
(which may itself be a function of previous data, the Horizon H, and any previously col- 
lected observations (x,y). To choose where to evaluate next, we first sample discretization 
points from u, then calculate the current Gaussian belief over / on the discretized domain, 
along with its derivatives. We construct an approximation to the belief over the minimum 
using Expectation Propagation, again with derivatives. Finally, we construct a first order 
approximation on the expected information gain from an evaluation at x' and optimize nu- 
merically. We evaluate / at this location, then the cycle repeats. Upon publication of this 
work, matlab source code for the algorithm and its sub-routines will be made available 
online. 

3. Experiments 

Figures in previous sections provided some intuition and anecdotal evidence for the effi- 
cacy of the various approximations used by Entropy Search. In this section, we compare 
the resulting algorithm to two Gaussian process global optimization heuristics: Expected 
Improvement, Probability of Improvement (Section 1.1), as well as to a continuous armed 
bandit algorithm: GP-UCB (Srinivas et al., 2010). For reference, we also compare to a 
number of numerical optimization algorithms: Trust-Region-Reflective (Coleman and Li, 
1996, 1994), Active-Set (Powell, 1978b,a), interior point (Byrd et al., 1999, 2000; Waltz 
et al., 2006), and a naively projected version of the BFGS algorithm (Broyden et al., 1965; 
Fletcher, 1970; Goldfarb, 1970; Shanno, 1970). We avoid implementation bias by using a 
uniform code framework for the three Gaussian process-based algorithms, i.e. the algorithms 
share code for the Gaussian process inference and only differ in the way they calculate their 
utility. For the local numerical algorithms, we used third party code: The projected BFGS 
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method is based on code by Carl Rasmussen 3 , the other methods come from version 6.0 of 
the optimization toolbox of matlab 4 . 

In some communities, optimization algorithms are tested on hand-crafted test functions. 
This runs the risk of introducing bias. Instead, we compare our algorithms on a number of 
functions sampled from a generative model. In the first experiment, the function is sampled 
from the model used by the GP algorithms themselves. This eliminates all model- mismatch 
issues and allows a direct comparison of other GP optimizers to the probabilistic optimizer. 
In a second experiment, the functions were sampled from a model strictly more general 
than the model used by the algorithms, to show the effect of model mismatch. 

3.1 Within-Model Comparison 

The first experiment was carried out over the 2-dimensional unit domain / = [0, l] 2 . To 
generate test functions, 1000 function values were jointly sampled from a Gaussian process 
with a squared-exponential covariance function of length scale I = 0.1 in each direction 
and unit signal variance. The resulting posterior mean was used as the test function. 
All algorithms had access to noisy evaluations of the test functions. For the benefit of 
the numerical optimizers, noise was kept relatively low: Gaussian with standard deviation 
a = 10~ 3 . All algorithms were tested on the same set of 40 test functions, all Figures in 
this section are averages over those sets of functions. It is nontrivial to provide error bars 
on these average estimates, because the data sets have no parametric distribution. But the 
regular structure of the plots, given that individual experiments were drawn i.i.d., indicates 
that there is little remaining stochastic error. 

After each function evaluation, the algorithms were asked to return a best guess for the 
minimum x m [ n . For the local algorithms, this is simply the point of their next evaluation. 
The Gaussian process based methods returned the global minimum of the mean belief over 
the function (found by local optimization with random restarts). Figure 10 shows the dif- 
ference between the global optimum of the function and the function value at the reported 
best guesses. Since the best guesses do not in general lie at a datapoint, their quality can 
actually decrease during optimization. The most obvious feature of this plot is that local 
optimization algorithms are not adept at finding global minima, which is not surprising, 
but gives an intuition for the difficulty of problems sampled from this generative model. 
The plot shows a clear advantage for Entropy Search over its competitors, even though the 
algorithm does not directly aim to optimize this particular loss function. The flattening 
out of the error of all three global optimizers toward the right is due to evaluation noise 
(recall that evaluations include Gaussian noise of standard deviation 10 -3 ). Interestingly, 
Entropy Search flattens out at an error almost an order of magnitude lower than that of 
the nearest competitor, Expected Improvement. One possible explanation for this behav- 
ior is a pathology in the classic heuristics: Both Expected Improvement and Probability 
of Improvement require a "current best guess" r], which has to be a point estimate, be- 
cause proper marginalization over an uncertain belief is not tractable. Due to noise, it can 



3. http://www.gaussianprocess.org/gpml/code/matlab/util/minimize.rn, version using BFGS: per- 
sonal communication 

4. http : //www . mathworks .de/help/toolbox/optim/rn/bsqj_zi .html 
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Figure 10: Distance of function value at optimizers' best guess for x m i n from true global 
minimum. Log scale. 



23 



Hennig and Schuler 



10° 




—4 I I I I I I I I I I 

10 20 30 40 50 60 70 80 90 100 



# of evaluations 

Figure 11: Euclidean distance of optimizers' best guess for rc m j n from truth. Log scale. 



thus happen that this best guess is overly optimistic, and the algorithm then explores too 
aggressively in later stages. 

Figure 11 shows data from the same experiments as the previous figure, but plots Eu- 
clidean distance from the true global optimum in input space, rather than in function value 
space. The results from this view are qualitatively similar to those shown in Figure 10. 

Since Entropy Search attempts to optimize information gain from evaluations, one would 
also like to compare to algorithms on the entropy loss function. However, this is challenging. 
First, the local optimization algorithms provide no probabilistic model of the function and 
can thus not provide this loss. But even for the optimization algorithms based on Gaussian 
process measures, it is challenging to evaluate this loss globally with good resolution. The 
only option is to approximately calculate entropy, using the very algorithm introduced in 
this paper. Doing so amounts to a kind of circular experiment that Entropy Search wins 
by definition, so we omit it here. 

We pointed out in Section 1.1 that the bandit setting differs considerably from the kind 
of optimization discussed in this paper, because bandit algorithms try to minimize regret, 
rather than improve an estimate of the function's optimum. To clarify this point further, 
Figure 12 shows the regret 

T 

r(T) = Y^[vt-fwm] (24) 
t=i 

for each of the algorithms. Notice that probability of improvement, which performs worst 
among the global algorithms as seen from the previous two measures of performance, 
achieves the lowest regret. The intuition here is that this heuristic focusses evaluations 
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Figure 12: Regret as a function of number of evaluations. 



on regions known to give low function values. In contrast, the actual value of the function 
at the evaluation point has no special role in Entropy Search. The utility of an evaluation 
point only depends on its expected effect on knowledge about the minimum of the function. 

Surprisingly, the one algorithm explicitly designed to achieve low regret, GP-UCB, per- 
forms worst in this comparison. This algorithm chooses evaluation points according to 
(Srinivas et al., 2010) 

x ncxt = arg mm[fi(x) - (3 1/2 a(x)] where (3 = A(D + 1) log T + C(k, 5) (25) 

X 

with T, the number of previous evaluations, D, the dimensionality of the input domain, 
and C(k,5) is a constant that depends on some analytic properties of the kernel k and a 
free parameter, < 5 < 1. We found it hard to find a good setting for this 5, which clearly 
has influence on the algorithm's performance. The results shown here represent the best 
performance over a set of 4 experiments with different choices for 5. They appear to be 
slightly worse than, but comparable to the empirical performance reported by the original 
paper on this algorithm (Srinivas et al., 2010, Figure 5a). 

3.2 Out-of-Model Comparison 

In the previous section, the algorithms attempted to find minima of functions sampled 
from the prior used by the algorithms themselves. In real applications, one can rarely 
hope to be so lucky, but hierarchical inference can be used to generalize the prior and 
construct a relatively general algorithm. But what if even the hierarchically extended prior 
class does not contain the true function? Qualitatively, it is clear that, beyond a certain 
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Figure 13: Left: A sample from the GP prior with squared exponential kernel used in 
the on-model experiments of Section 3.1. Right: Sample from prior with the 
rational quadratic kernel used for the out of model comparison of Section 3.2. 



point of model-mismatch, all algorithms can be made to perform arbitrarily badly. The 
poor performance of local optimizers (which may be interpreted as building a quadratic 
model) in the previous section is an example of this effect. In this section, we present 
results of the same kind of experiments as in the previous section, but on a set of 30 two- 
dimensional functions sampled from a Gaussian process prior with rational quadratic kernel, 
with the same length scale and signal variance as above, and scale mixture parameter a = 1 
(see Equation (7)). This means samples evolve over an infinite number of different length 
scales, including both longer and shorter scales than those covered by the priors of the 
algorithms (Figure 13). Figure 14 shows error on function values, Figure 15 Euclidean error 
in input space, Figure 16 regret. Note the different scales for the ordinate axes relative 
to the corresponding previous plots: While Entropy Search still (barely) outperforms the 
competitors, all three algorithms perform worse than before; and their errors become more 
similar to each other. However, they still manage to discover good regions in the domain, 
demonstrating a certain robustness to model-mismatch. 

4. Conclusion 

This paper presented a new probabilistic paradigm for global optimization, as an inference 
problem on the minimum of the function, rather than the problem of collecting iteratively 
lower and lower function values. We argue that this description is closer to practitioners' 
requirements than classic response surface optimization, bandit algorithms, or other, heuris- 
tic, global optimization algorithms. In the main part of the paper, we constructed Entropy 
Search, a practical probabilistic global optimization algorithm, using a series of analytic 
assumptions and numerical approximations: A particular family of priors over functions 
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Figure 14: Function value error, off-model tasks. 
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Figure 15: Error on x m i n , off-model tasks. 
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Figure 16: Regret, off-model tasks. 



(Gaussian processes); constructing the belief p m i n over the location of the minimum on an 
irregular grid to deal with the curse of dimensionality; and using Expectation Propagation 
toward an efficient analytic approximation. The Gaussian belief allows analytic probabilis- 
tic predictions of the effect of future datapoints, from which we constructed a first-order 
approximation of the expected change in relative entropy of p m \ n to a base measure. For 
completeness, we also pointed out some already known analytic properties of Gaussian 
process measures that can be used to generalize this algorithm. We showed that the re- 
sulting algorithm outperforms both directly and distantly related competitors through its 
more elaborate, probabilistic description of the problem. This increase in performance is 
exchanged for somewhat increased computational cost (Entropy Search costs are a constant 
multiple of that of classic Gaussian process global optimizers). So this algorithm is more 
suited for problems were evaluating the function itself carries considerable cost. Never- 
theless, it provides a natural description of the optimization problem, by focusing on the 
performance under a loss function at the horizon, rather than function values returned dur- 
ing the optimization process. It allows the practitioner to explicitly encode prior knowledge 
in a flexible way, and adapts its behavior to the user's loss function. 
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Appendix A. Mathematical Appendix 

The notation in Equation (1) can be read, sloppily, to mean u p m - m (x) is the probability 
that the value of / at x is lower than at any other x € For a continuous domain, 
though, there are uncountably many other x. To give more precise meaning to this notation, 
consider the following argument. Let there be a sequence of locations {a;i}i=i,...,jv> such 
that for N -> oo the density of points at each location converges to a measure m(x) 
nonzero on every open neighborhood in /. If the stochastic process p(f) is sufficiently 
regular to ensure samples are almost surely continuous (see footnote in Section 2.1), then 
almost every sample can be approximated arbitrarily well by a staircase function with 
steps of width m(xi)/N at the locations X{, in the sense that Ve > 3Nq > such that, 
ViV > No : \f(x) — /(argmiOj, J=1) )Ar \x — Xj\)\ < e, where | • | is a norm (all norms on 
finite-dimensional vector spaces are equivalent). This is the original reason why samples 
from sufficiently regular Gaussian processes can be plotted using finitely many points, in 
the way used in this paper. We now define the notation used in Equation (1) to mean the 
following limit, where it exists. 



In words: The "infinite product" is meant to be the limit of finite-dimensional integrals with 
an increasing number of factors and dimensions, where this limit exists. In doing so, we 
have side-stepped the issue of whether this limit exists for any particular Gaussian process 
(i.e. kernel function). We do so because the theory of suprema of stochastic processes is 
highly nontrivial. We refer the reader to a friendly but demanding introduction to the 
topic by Adler (1990). From our applied standpoint, the issue of whether (26) is well 
defined for a particular Gaussian prior is secondary: If it is known that the true function is 
continuous and bounded, than it has a well-defined supremum, and the prior should reflect 
this knowledge by assigning sufficiently regular beliefs. If the actual prior is such that we 
expect the function to be discontinuous, it should be clear that optimization is extremely 
challenging anyway. We conjecture that the finer details of the region between these two 
domains have little relevance for communities interested in optimization. 
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